Data visualization with R

In recent times data visualization has become very important for any application which involves data. There are at least two ways in which being able to effectively visualize the data can help us:

  1. When we explore the data: it is way easier to see a pattern if we are looking at a graph rather than a bunch of numbers
  2. When we report the data: nice graphs catch the eye, and convey our message much more effectively than a bunch of numbers

R is pretty good at this. It has a good graphic engine that allows us to make some very good-looking plots (compare them with Excel, Stata or SPSS plots and you see what I mean). Moreover, the package ggplot2, developed in 2005 and since become one of the most widely-used tools to visualize data, makes it very easy to produce nice-looking charts that convey complex information in a relatively simple way.

Today we will do several things. We will play around with the functions to plot data that are present in base R. Then we’ll start using ggplot2. In both cases, as usual, we are going to see some examples, however keep in mind that the possibilites of visualizing data go well beyond what we see here. Finally we will see an application applied to a topic model.

Very first, let’s load the data. To begin we are going to use 2 datasets:

# In case you do not have `rio` installed:
# install.packages("rio", dependencies = T)

library(rio)
setwd("/folder/where/you/keep/the/data")
# Import the CSES data 
cses <- import("cses2018itaRAW.xlsx")
# Import the V-Dem data
vdem <- import("vdem_small.dta")
head(vdem)
##   country year liberal free_exp gdp_pc_log postcom
## 1 Austria 1990   0.916    0.976   10.09307       0
## 2 Austria 1991   0.916    0.976   10.14471       0
## 3 Austria 1992   0.916    0.976   10.17072       0
## 4 Austria 1993   0.916    0.976   10.18203       0
## 5 Austria 1994   0.916    0.976   10.23265       0
## 6 Austria 1995   0.912    0.976   10.27291       0

The V-Dem dataset is useful because introduces a time dimension to the data that we might encounter from time to time. The same 28 countries are observed at up to 29 points in time (from 1990 to 2018). So we have two dimensions of variation: (1) the between-country variation, and (2) the within-country, over time variation. The dataset contains the following variables:

The CSES data contain the following variables:

Visualization with base R

Even using base R, there are a lot of things that we can do to visualize data. However, many of them can be done much more effectively by using ggplot2. So we will look here at some basic functions that we can use for a quick-and-dirty inspection of the data.

hist(vdem$gdp_pc_log)

Note that we can change a few parameters to the chart. For instance, in the chart below we color the bars in blue, we add a label to the x-axis and we remove the main title.

hist(vdem$gdp_pc_log, col = "blue", 
     xlab = "Logged GDP per capita", main = "")

boxplot(vdem$gdp_pc_log)

Boxplots can also show a bivariate relationship between two variables. For instance, we may want to look at the distribution of the variable gdp_pc_log for each year in which the countries were observed in the data. In this case we input a formula written as y ~ x, where y is the variable for which we want to see the distribution (in our case gdp_pc_log) and x is the grouping factor (in our case year). We can also make the plot horizontal.

boxplot(gdp_pc_log ~ year, data = vdem, 
        horizontal = T,                 # Make a horizontal boxplot
        las = 2)                        # Set the direction of the value labels on the axes

We could also use country as a grouping factor.

boxplot(gdp_pc_log ~ country, data = vdem, 
        horizontal = T,
        las = 2)

However, this confronts us with 2 problems:

  1. The country names go out of the margins of the figure. This is not good for obvious reasons.
  2. It is difficult to make sense of this plot, as the countries are displayed in alphabetical order, so the boxplots are all over the place.

Let’s start with the first issue. In R, we can change the margins of a figure by tweaking with the default graphical parameters that are used by R to produce any chart in the current session. We can see them (and change them) by using the function par(). The parameter that we are interested in are the figure margins, that is called mar. We can see the default values by typing the following:

par()$mar
## [1] 5.1 4.1 4.1 2.1

Now, the values refer to the distance between the margins of the figure and the margins of the plot area. The order in which they are saved is bottom, left, top, right. As you can see, there is a bit more space at the bottom (to accommodate the x-axis label and the value labels), on the left (to accommodate the y-axis label and the value labels), and on top (to accommodate the main title).

In our case, we need more space on the left, so we can change the values accordingly. However, the values will be changed for the entire session, so we better save the default values and restore them after we have done the plot.

# Save the default graphical parameters
pdefault <- par()
# Change the margins to have more space on the left
par(mar = c(5, 8, 4, 2))
# Make the boxplot
boxplot(gdp_pc_log ~ country, data = vdem, 
        horizontal = T,
        las = 2,
        ylab = "")

# Restore the default parameters (will return a warning)
par(pdefault)

The second issue is less technical and more substantive. The idea of making a boxplot by country is to compare countries with respect to their GDP per capita. However, is the countries are sorted in alphabetical order, the comparison is not that easy as it could be. The information is all there, but it is not organized in a way that makes the relevant message emerge–the relevant message being which countries have a systematically higher GDP per capita, and which ones have a systematically lower GDP per capita in the period observed. One way to make this piece of information emerge is by sorting the countries by their median GDP per capita. This way we can see the pattern of between-country variation much more clearly.

Excursus: factor variables, sorting and plotting

To do so, we need to create a new factor variable for the country, with the levels sorted by the median of gdp_pc_log within each country. To understand how this works, we need to dig a bit into the property of factor variables in R.

Let’s make an example by creating a fake data frame with a nominal variable var.

fda <- data.frame(var = c("a", "b", "c"))

As you can see, by default R stores nominal variables as factor variables. We can ask whether this is the case to R explicitly:

is.factor(fda$var)
## [1] FALSE

Factor variables have a property: they are labeled numeric variables, i.e. their true value is numeric (integer) but the numeric value is a placeholder that makes sense only when associated to a label (which, in our case, are the letters “a”, “b” and “c”). The labels associated to the unique values of the factor variable are called “levels”. We can see what the levels of var are by looking at the variable itself:

fda$var
## [1] "a" "b" "c"

When R has uses the factor variable as a grouping factor, it will use by default the alphabetical order of the levels. This can be seen by making a table of var:

table(fda$var)
## 
## a b c 
## 1 1 1

However, we can change the order with which the levels of the factor appear in the output (in this case the table) by changing the attribute of the variable itself. For instance, if we want to change the order from c("a", "b", "c") to c("b", "c", "a"), we can do the following:

fda$var <- factor(fda$var, levels = c("b", "c", "a"))

Now if we ask for a table of var, the values will be displayed in the order that we set

table(fda$var)
## 
## b c a 
## 1 1 1

We can take advantage of this property when we want to order nominal categories in a plot. In our case, we need to order the levels of the variable country by the median of gdp_pc_log. We can do this automatically using the function reorder(). We do it by creating a new variable called country_sort.

vdem$country_sort <- with(vdem, reorder(country, gdp_pc_log, function(x) median(x, na.rm = T)))
# Let's make a table of the first 100 observations in the data to see whether it worked
with(vdem[1:100, ], table(country_sort, country))
##                 country
## country_sort     Austria Belgium Bulgaria Cyprus
##   Romania              0       0        0      0
##   Bulgaria             0       0       29      0
##   Latvia               0       0        0      0
##   Lithuania            0       0        0      0
##   Poland               0       0        0      0
##   Estonia              0       0        0      0
##   Croatia              0       0        0      0
##   Slovakia             0       0        0      0
##   Hungary              0       0        0      0
##   Czech Republic       0       0        0      0
##   Portugal             0       0        0      0
##   Malta                0       0        0      0
##   Slovenia             0       0        0      0
##   Greece               0       0        0      0
##   Cyprus               0       0        0     13
##   Spain                0       0        0      0
##   France               0       0        0      0
##   Italy                0       0        0      0
##   Finland              0       0        0      0
##   Belgium              0      29        0      0
##   Germany              0       0        0      0
##   United Kingdom       0       0        0      0
##   Austria             29       0        0      0
##   Denmark              0       0        0      0
##   Sweden               0       0        0      0
##   Netherlands          0       0        0      0
##   Ireland              0       0        0      0
##   Luxembourg           0       0        0      0

Now we can make our boxplot using the (ordered) variable country_sort instead of the (unordered) variable country.

par(mar = c(5, 8, 4, 2))
boxplot(gdp_pc_log ~ country_sort, data = vdem, 
        horizontal = T,
        las = 2,
        ylab = "")

par(pdefault)

This way, the plot is much more informative than with the unordered countries.

  • Scatter plots are the way to show bivariate relationships between two variables. They can be done in base R with the function plot(). Also, since we would have to write the name of the dataset twice, once before each variable to identify the variable with the dollar sign $, we can use the function with() to specify the dataset on which we are picking both variables at the beginning.
with(vdem, plot(x = liberal, y = gdp_pc_log))

Which can also be written using the formula (with y being gdp_pc_log and x being liberal):

# This will produce the same plot as above
plot(gdp_pc_log ~ liberal, data = vdem)

These were just a few examples, we can actually do a lot of different charts with base R, but the real deal comes now.

Visualization with ggplot2

While base R allows to do many things, there are some packages that allow us to visualize very complex information in a relatively easy way. The most widely used is surely ggplot2. Like many other packages we saw these days, ggplot2 is part of the tidyverse. This means that it is very well integrated with other tidyverse routines, most notably dplyr. To take advantage of this integration, instead of loading ggplot2 only, we will load the full tidyverse collection.

# In case you do not have `tidyverse` installed:
# install.packages("tidyverse", dependencies = T)

library(tidyverse)

The syntax of ggplot2 works in a slightly different way than the one of base R. In a way, it is closer to the logic of piped expressions in dplyr (although it is not the same thing). The main function to produce a chart is called (unsurprisingly) ggplot(). The type of information that the function needs is pretty much the same information we will have to put in a base R function: we need to specify the data where the variables that we want to plot are, and we need to specify the variables that we want to plot. These go into a specific subfunction which is called aes().

Note that the function ggplot() is used to produce any chart that we want to make with ggplot2. This clearly implies that we need to add somewhere the information about what kind of plot we want to make (e.g. a scatter plot, a box plot, etc.). This is done in ggplot2 by means of adding layers to the plot. The layers are pieces of information that we integrate in the existing plot. They may contain more data, or just the geometric pattern that we want to display in the chart (which makes the plot type). In fact, without layers there is no chart in ggplot2. If we try to make a plot without layers nothing will come out.

ggplot(data = vdem, aes(x = liberal, y = gdp_pc_log))

Layers in ggplot2 are added by using the set of functions geom_*. For instance, if we want to add points to the plot (therefore making a scatter plot) we can add the layer geom_point(). However, in ggplot2 syntax, this is not done within the ggplot() function, but it is chained after the initial call by literally adding the layer using the plus + sign.

ggplot(vdem, aes(x = liberal, y = gdp_pc_log)) +
  geom_point()

The layers are used for everything, from adding axis labels to plotting more data. Inside the layer we can change the options for that specific graphical parameter, but we can also specify the source data and variables to plot in the layer. For instance, the very same chart above could also be produced putting all the information within the geom_point() function:

# Same as above, run to see
ggplot() +
  geom_point(data = vdem, aes(x = liberal, y = gdp_pc_log))

A few examples:

ggplot(vdem, aes(x = gdp_pc_log)) +
  geom_histogram()

ggplot(vdem, aes(x = gdp_pc_log)) +
  geom_histogram(alpha = 0.2, col = "black", fill = "blue") +
  xlab("Log GDP per capita") +
  ylab("Count") +
  theme_bw()

ggplot(vdem, aes(x = reorder(country, gdp_pc_log, function(x) mean(x, na.rm = T)), y = gdp_pc_log)) +
  geom_violin(fill = "gray50", trim = F, scale = "width") +
  geom_boxplot(fill = "white", width = 0.2) +
  coord_flip() +
  ylab("Log GDP per capita") +
  xlab("Country") +
  theme_bw()

Multivariate plots

The vdem data allows us to view time trends of some specific variables. For instance, we can see how the “liberal democracy index” changed in Italy over time. This would be a bivariate plot, as we are looking at the relationship between one phenomenon (time) and another (liberal democracy in a country).

ggplot(filter(vdem, country == "Italy"), aes(y = liberal, x = year)) +
  geom_line() +
  geom_point() +
  ylab("Liberal democracy index") +
  xlab("Year") +
  scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
  theme_bw()

We can add a third variable (effectively showing a multivariate relationship) by breaking down the relationship between two variables country by country. We can do it by adding a grouping factor and using the colors to identify different countries:

ggplot(vdem, aes(y = gdp_pc_log, x = year, group = country)) +
  geom_line(aes(col = country)) +
  ylab("Log GDP per capita") +
  xlab("Year") +
  scale_x_continuous(breaks = seq(1990, 2016, by = 2)) +
  theme_bw()

However, with 28 countries it gets very hard to understand anything.

An alternative we can show the pattern country by country by faceting the plot with facet_wrap()

ggplot(vdem, aes(x = year, y = gdp_pc_log)) +
  geom_line() +
  facet_wrap(~country, ncol = 5) +
  ylab("Log GDP per capita") +
  xlab("Year") +
  scale_x_continuous(breaks = seq(1990, 2020, by = 5)) +
  theme_bw()

Regression lines and smoothers

Going back to the trend of liberal over time in Italy, next to (or rather than) plotting the individual points connected by a line, we can add a smoothing line to make the pattern emerge in a clearer way. This is one step towards fitting a regression model, and it is done with the function geom_smooth(). If we do not specify anything, the function will fit a non-parametric function to interpolate the point, generally a lowess (in case of few observations, like in our case) or a GAM (when there are more observations).

ggplot(filter(vdem, country == "Italy"), aes(y = liberal, x = year)) +
  geom_point() +
  geom_smooth() +
  ylab("Liberal democracy index") +
  xlab("Year") +
  scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
  theme_bw()

Of course, this can be done by group as well:

ggplot(vdem, aes(y = gdp_pc_log, x = year, group = country)) +
  geom_smooth(aes(col = country), se = F) +
  ylab("Log GDP per capita") +
  xlab("Year") +
  scale_x_continuous(breaks = seq(1990, 2016, by = 2)) +
  theme_bw()

geom_smooth() can also be used to fit a regression line in the plot by adding the option method = "lm". For instance, we can look at the linear relationship between the “Freedom of expression” index (the variable free_exp) and the “Liberal democracy” index.

ggplot(vdem, aes(x = free_exp, y = liberal)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  ylab("Liberal democracy index") +
  xlab("Freedom of expression index") +
  theme_bw()

Combining ggplot2 with dplyr

Being part of the tidyverse, ggplot2 is very well integrated with the other packages in the ecosystem. This allows for instance for a very easy concatenation of dplyr functions (to modify or summarize the data) and ggplot2 functions (to visualize them).

For instance, we may want to see whether from 1990 there has been a convergence in terms of liberal democracy between Western-European countries and post-communist countries in the Central-Eastern European bloc. To do so we need first to summarize the variable liberal by group (using as grouping factor the dummy postcom) and plot the two different groups of countries against each other. By using the pipe operator %>% we can chain all the data manipulation operations and the plot directly. We only need to remember to switch from the pipe (for the dplyr part) to the plus + sign (for the ggplot2 part).

vdem %>%                                                                  # Data manipulation
  group_by(year, postcom) %>%
  summarize(libm = mean(liberal, na.rm = T)) %>%
  mutate(
    gr = ifelse(postcom == 0, "West\nEurope", "Central-Eastern\nEurope")
  ) %>%
  ggplot(., aes(x = year, y = libm, group = gr)) +                        # Plotting
  geom_smooth(aes(col = gr)) +
  scale_color_manual(values = c("red", "blue"), "Geo-political area") +
  scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
  xlab("") +
  ylab("Liberal democracy index") +
  theme_bw() +
  theme(legend.position = "bottom")

Another example: using the cses data, we want to see what is the share of respondents in each geographical area based on their evaluation of the economy over the year before the (2018) elections. We can do this in a couple of ways. We can make a bar plot (as we saw before). We can do it in 2 ways, “stacked” or “dodged”. We are going to check them both out.

# Data wrangling
ecodat <- cses %>%
  filter(!is.na(eco_eval)) %>%
  group_by(area, eco_eval) %>%
  summarize(nobs = n()) %>%
  group_by(area) %>%
  mutate(
    share = nobs / sum(nobs),
    gr = ifelse(eco_eval == -1, "Got worse",
                ifelse(eco_eval == 0, "Remained\nthe same",
                       "Got better"))
    ) %>%
  ungroup() %>%
  mutate(
    area = factor(area, levels = rev(c("Nord-Ovest", "Nord-Est", "Centro", "Sud"))),
    gr = factor(gr, levels = c("Got worse", "Remained\nthe same", "Got better"))
  )

# Stacked bar chart
ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c("gray20", "gray50", "gray80"), 
                    "Evaluation of the economy\nover the last year") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2),
                     labels = scales::percent) +
  ylab("Share of respondents") +
  xlab("Geographical area") +
  ggtitle("Stacked bar chart") +
  theme_bw()

# Dodged bar chart
ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
  geom_bar(stat = "identity", position = "dodge", col = "black") +
  scale_fill_manual(values = c("gray20", "gray50", "gray80"), 
                    "Evaluation of the economy\nover the last year") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2),
                     labels = scales::percent) +
  ylab("Share of respondents") +
  xlab("Geographical area") +
  ggtitle("Dodged bar chart") +
  theme_bw() +
  theme(legend.position = "bottom")

Exporting

Once we have the plots, we may want to save them so we can put them in our paper or report or website or whatever we need the plots for. In RStudio we can simply use the “Export” button over the bottom-right panel, but we might want to export it directly using the script so we can automate the process (for instance if we have to do many plots) and we have more control over some parameters like the size of the figure.

In ggplot2, we can do this using the function ggsave(). We can do this in two ways:

  1. Put the plot into an obsect and apply ggsave() to that object.
# Save the plot in an object
plot1 <- ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
  geom_bar(stat = "identity", position = "dodge", col = "black") +
  scale_fill_manual(values = c("gray20", "gray50", "gray80"), 
                    "Evaluation of the economy\nover the last year") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2),
                     labels = scales::percent) +
  ylab("Share of respondents") +
  xlab("Geographical area") +
  theme_bw() +
  theme(legend.position = "bottom")

# Save the plot
ggsave(plot1, filename = "eco_barplot.pdf",
         device = cairo_pdf, width = 6, height = 5)

Note: if we put a plot into an object and just save it, of course we won’t be able to see the plot. To do so we need to call it explicitly.

  1. Another way to do this is to chain the function ggsave() directly after making the plot using the pipe %>%. We’ll need to put an additional pair of parentheses surrounding the group of functions making the plot. The code below will also not visualize the plot.
(ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
  geom_bar(stat = "identity", position = "dodge", col = "black") +
  scale_fill_manual(values = c("gray20", "gray50", "gray80"), 
                    "Evaluation of the economy\nover the last year") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2),
                     labels = scales::percent) +
  ylab("Share of respondents") +
  xlab("Geographical area") +
  theme_bw() +
  theme(legend.position = "bottom")) %>%
  ggsave(., filename = "eco_barplot.pdf",
         device = cairo_pdf, width = 6, height = 5)

Obviously, by the same logic we can also pipe everything, from data preparation, visualization and exporting. An example of a scatter plot with labels and grid faceting (it’s being visualized here even though this operation won’t show the plots, since they are being directly saved).

By the way, when we put labels on points, they might overlap, and this is bad. Let’s use the package ggrepel to arrange them automatically in a way that they do not overlap. However, labels are too many and ggrepel will drop some of them (returning a warning message).

# Remember to install the package
# install.packages("ggrepel)

library(ggrepel)

(vdem %>%
  mutate(
    decade = ifelse(year < 2000, "1990s",
                    ifelse(year < 2010, "2000s", "2010s")),
    gr = ifelse(postcom == 0, "West\nEurope", "Central-Eastern\nEurope"),
    gr = factor(gr, levels = c("West\nEurope", "Central-Eastern\nEurope"))
  ) %>%
  group_by(country, decade) %>%
  summarize(
    gr = unique(gr),
    liberal = mean(liberal, na.rm = T),
    gdp = mean(gdp_pc_log, na.rm = T)
  ) %>%
  ggplot(., aes(x = gdp, y = liberal)) +
  geom_smooth(method = "lm", se = F, col = "blue") +
  geom_point() +
  geom_text_repel(aes(label = country)) +      # The function from ggrepel()
  facet_grid(gr ~ decade) +
  xlab("Freedom of expression index") +
  ylab("Liberal democracy index") +
  theme_bw()) %>%
  ggsave(., filename = "scatter_gdp_liberal.pdf",
         device = cairo_pdf, width = 10, height = 7)
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider increasing max.overlaps

Visualizing topic models

Data visualization can help us with our text analysis tasks. One domain where visual diagnostics are quite useful is when we are doing topic models. There are several moments when visualization is useful: (1) when we need to select the number of topics that the model is supposed to produce, (2) when we want to visualize the top words in every topic, and (3) when we want to see the distribution of the topics in the corpus.

We will do an easy example using the dataset called FB_posts.xlsx, which is a sample of Facebook posts from the official pages of 6 UK parties (Conservatives, Labour, LibDems, Brexit Party, Greens and UKIP) from August 2019 to October 2020. The texts have been cleaned a bit already, so we will need to to only limited pre-processing (we will do some standard operations).

Let’s start by installing the packages that we need: quanteda (to make the corpus object, tokens and so on), and its extension quanteda.textstats, tidytext (another package to do text management), and stm to fit structural topic models. The routine below will install the packages that we need and load them automatically.

want = c("quanteda", "quanteda.textstats", "tidytext", "stm")
have = want %in% rownames(installed.packages())
if ( any(!have) ) { install.packages( want[!have] , dependencies = T) }
junk <- lapply(want, library, character.only = T)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:rio':
## 
##     convert
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
rm(have, want, junk)

We can now load the data and prepare it for the analysis.

# Import the data
fb_data <- import("FB_posts.xlsx")

# Make the corpus with "quanteda"
corp_uk <- corpus(fb_data, text_field = "message")

# Make tokens, further cleaning the text
tok_uk <- corp_uk %>%
  tokens(remove_punct = T,
         remove_symbols = T,
         remove_numbers = T) %>%
  tokens_remove(stopwords("en"))

# Look for relevant n-grams
ngr <- textstat_collocations(
  tok_uk, 
  size = 2:3,
  min_count = 10)

head(ngr)
##     collocation count count_nested length   lambda        z
## 1 boris johnson   104          104      2 8.630141 22.09312
## 2   stop brexit    49           49      2 4.763997 21.51931
## 3   green party    33           33      2 5.131901 20.20941
## 4       join us    29           29      2 4.890953 19.02816
## 5    get brexit    39           39      2 3.997924 18.84504
## 6   brexit done    42           42      2 5.541543 18.83167
# Combine the tokens to get the n-grams that we found
tok_uk <- tokens_compound(
  tok_uk, 
  phrase(ngr$collocation), 
  join = T
)

# Make Document-Feature Matrix, and trim it (select only very frequent words that discriminate well between texts)
dfm_uk <- dfm(tok_uk) %>% 
  dfm_trim(
    min_termfreq = 0.9, termfreq_type = "quantile",
    max_docfreq = 0.1, docfreq_type = "prop",
    verbose = T
  )
## Removing features occurring:
##   - fewer than 9 times: 3,922
##   - in more than 100 documents: 4
##   Total features removed: 3,926 (89.7%).
# Remove the documents which have less than 3 words left
dfm_uk <- dfm_uk[rowSums(dfm_uk) > 2,]
dim(dfm_uk)
## [1] 895 452

stm requires data in a slightly different format than quanteda DFM, so we will need to convert the object to a different type.

dfm_uk <- convert(dfm_uk, to = "stm")

Finding the “best” number of topics

stm has a nice function to help us find the right number of topics, which is called searchK(). We need to specify a range of topic numbers, and stm will fit a set of topic models in that range and return some fit statistics that we can use to assess the best number of topics given the data (this is a brutal quantitative approach that does not substitute actually reading the topics produced by different solutions and figure out what makes more sense).

Note: an advantage of Structural Topic Models is that they allow to include predictors into the estimate, both to improve the quality of the estimate itself and to get some substantive insights. In this case, we will include in the model the variable pt_week_y, which is a string that concatenates the “year” of the post, the “month” and the “week”. This allows to include in the estimate (in an extremely overfitting manner) some information about when the post was done. This should correlate with the probability that the post is about some specific topic (e.g. issues of the day, but also self promotion by parties). We will just use this variable to improve the estimate, we will not inspect it substantively.

Watch out running the code below – it takes some time

set.seed(1234)
stm_many <- searchK(
  dfm_uk$documents, 
  dfm_uk$vocab, 
  K = seq(10, 200, by = 10),
  prevalence =~ factor(pt_week_y),
  data = dfm_uk$meta, 
  cores = 4
)
stm_many$results
##      K   exclus    semcoh   heldout   residual     bound    lbound em.its
## 1   10 9.546842 -146.8258 -5.807072   1.386329 -48047.68 -48032.58    132
## 2   20 9.739671 -149.6311 -5.854532   1.301555 -46071.94  -46029.6     88
## 3   30 9.803465 -158.5489 -5.976493   1.319745 -45065.54 -44990.89     62
## 4   40 9.824982 -153.7451 -6.048548   1.465606 -44028.88 -43918.56     89
## 5   50 9.839221 -150.8483 -6.196674   1.794245 -43363.32 -43214.84     82
## 6   60 9.835363 -154.3851 -6.148196    2.55257 -42679.77 -42491.14     90
## 7   70 9.840677 -154.4908 -6.275768   4.716161 -42520.08 -42289.64     74
## 8   80 9.850034 -156.1086  -6.25317    32.8066 -42223.62 -41949.95     76
## 9   90 9.857129 -157.3209  -6.19464  -4.261304 -41943.56 -41625.41     77
## 10 100 9.865058 -152.1958 -6.149852  -1.926544 -41624.61 -41260.87     82
## 11 110 9.866419 -156.0532 -6.177197     -1.302 -41513.36 -41103.03     85
## 12 120  9.86994 -157.5743 -6.341506 -0.9103978 -41259.93 -40802.12     89
## 13 130 9.869962 -155.1936 -6.302026 -0.6663003 -41069.04 -40562.91     83
## 14 140 9.875517 -156.5658 -6.228148 -0.5466153 -41025.68 -40470.46     90
## 15 150 9.880368  -156.197 -6.231056 -0.4391988 -40891.74 -40286.72     83
## 16 160 9.883717  -155.282 -6.276008 -0.3654539 -40851.47 -40195.99     86
## 17 170 9.885801 -151.2771 -6.180262 -0.3126482 -40686.17 -39979.59     88
## 18 180  9.88696 -154.4574  -6.34932 -0.2616176 -40662.94 -39904.69     66
## 19 190 9.889069 -153.5413 -6.352825 -0.2362103 -40462.15 -39651.67     87
## 20 200 9.891776 -154.0574 -6.333194 -0.2139281 -40434.24 -39571.01     90

We can now extract the results, and plot some metrics: semantic coherence, exclusivity, held-out likelihood and the model residuals (the first 3 to be maximized, the latter to be minimized).

stm_many$results <- stm_many$results %>% 
  mutate(
    across(everything(), ~as.numeric(.))
  )

# Let us make a plot with the most relevant metrics
stm_many$results %>% 
  select(K, exclus, semcoh, heldout, residual) %>% 
  pivot_longer(
    cols = -K
  ) %>% 
  mutate(
    name = recode(
      name,
      "exclus" = "1. Exclusivity",
      "semcoh" = "3. Semantic Coherence",
      "heldout" = "2. Held-out Likelihood",
      "residual" = "4. Residuals"
    )
  ) %>% 
  ggplot(aes(x = K, y = value)) +
  geom_line(aes(col = name)) +
  scale_color_discrete(guide = "none") +
  facet_wrap(~name, ncol = 2, scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 200, by = 10)) +
  theme_bw()

Hard to say, however there do not seem to be too many topics in here, for sure not more than 100. Let us see the trade-off between semantic coherence and exclusivity (the most meaningful metrics that we have here).

ggplot(stm_many$results, aes(x = semcoh, y = exclus)) +
  geom_smooth(se = F) +
  geom_point() + 
  geom_text_repel(aes(label = K)) +
  xlab("Semantic coherence") + ylab("Exclusivity") +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It looks like a good solution might be 50 topics. So let’s fit a topic model (still including the variable pt_week_y as predictor).

stm_fit <- stm(
  dfm_uk$documents, 
  dfm_uk$vocab, 
  prevalence =~ factor(pt_week_y),
  K = 50, 
  data = dfm_uk$meta, 
  seed = 1234,
  verbose = F)

Explore topics’ content and prevalence

We can now have a look at (1) what are the most important words in each topic (to understand the “content” or meaning of the topics) and (2) how prevalent each topic is in the corpus. To get the first piece of information, we need to extract the word probabilities for each topic, that in STM language is the set of beta coefficients associated with each word. To get the second piece of information, we need to extract the prevalence of each topic in the corpus, the gamma coefficients associated to each topic.

# Extract word probabilities for each topic (that `stm` calls "beta")
beta_stm <- tidytext::tidy(stm_fit, matrix = "beta") 

# Extract topic probabilities in the corpus (that `stm` calls "gamma")
gamma_stm <- tidytext::tidy(stm_fit, matrix = "gamma")

# Clean "beta" data
top_terms <- beta_stm %>%
  arrange(beta) %>%
  group_by(topic) %>%
  slice_max(beta, n = 7) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarize(terms = list(term)) %>%
  mutate(terms = sapply(terms, paste, collapse = ", "))

# Clean "gamma" data
gamma_terms <- gamma_stm %>%
  group_by(topic) %>%
  summarize(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

# Let's put them together and make a barplot
gamma_terms %>%
  top_n(20, gamma) %>%
  ggplot(aes(x = topic, y = gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0, nudge_y = 0.0005, size = 3) +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.09),
                     labels = scales::percent_format(1L)) +
  theme_bw() +
  xlab("") + ylab(expression(gamma))

And now it’s our turn to interpret the results.